import pandas as pd
# Load the CSV file
file_path = '/Users/JumpMan/Downloads/FM Data Analytics/Feyanoord/Moneyball Scouting/2023-24/2023-24 Eredivese.csv'
data = pd.read_csv(file_path)
# Check the first few rows
print("First few rows of the dataset:")
data.head()
First few rows of the dataset:
| Name | Position | Age | Height | Weight | Club | Division | Nationality | Home-Grown | Personality | ... | Attacking Midfielder | Creative Winger | Attacking Winger | Creative Forward | Attacking Forward | Finisher | Aerial Threat | Reader | Assister | Ball Winning Defenders | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Luuk de Jong | ST (C) | 33 | 6'2" | 185 lbs | PSV | Eredivisie | NED (SUI) | Trained in nation (15-21) | Fairly Professional | ... | 84 | 66 | 75 | 78 | 81 | 97 | 64 | 25 | 46 | 12 |
| 1 | Sergiño Dest | D/WB (RL), M (R) | 23 | 5'7" | 136 lbs | PSV | Eredivisie | USA (NED) | Trained in nation (15-21) | Balanced | ... | 48 | 67 | 53 | 49 | 46 | 35 | 63 | 83 | 55 | 71 |
| 2 | Seiya Maikuma | D/WB/AM (R), ST (C) | 26 | 5'10" | 152 lbs | AZ | Eredivisie | JPN | - | Balanced | ... | 93 | 91 | 85 | 78 | 87 | 82 | 64 | 48 | 78 | 31 |
| 3 | Bas Dost | ST (C) | 34 | 6'5" | 180 lbs | FC Groningen | Eredivisie | NED | Trained in nation (15-21) | Spirited | ... | 75 | 52 | 58 | 64 | 74 | 98 | 83 | 32 | 49 | 55 |
| 4 | Calvin Stengs | M (R), AM (RC) | 25 | 6'0" | 149 lbs | Feyenoord | Eredivisie | NED (SUR) | Trained in nation (15-21) | Fairly Loyal | ... | 98 | 100 | 99 | 98 | 99 | 82 | 57 | 45 | 96 | 10 |
5 rows × 210 columns
# Overview of the dataset
print("\nDataset Information:")
data.info()
Dataset Information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 952 entries, 0 to 951 Columns: 210 entries, Name to Ball Winning Defenders dtypes: float64(2), int64(194), object(14) memory usage: 1.5+ MB
# Summary statistics for numerical columns
print("\nSummary Statistics for Numerical Columns:")
data.describe()
Summary Statistics for Numerical Columns:
| Age | Starts | Minutes Played | Average Rating | Sub Appearances | Minutes/Game | Goals (percentile) | Goals/90 (percentile) | Minutes/Goal (percentile) | xG (percentile) | ... | Attacking Midfielder | Creative Winger | Attacking Winger | Creative Forward | Attacking Forward | Finisher | Aerial Threat | Reader | Assister | Ball Winning Defenders | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 952.000000 | 952.000000 | 952.000000 | 952.000000 | 952.000000 | 952.000000 | 952.000000 | 952.000000 | 952.000000 | 952.000000 | ... | 952.000000 | 952.000000 | 952.000000 | 952.000000 | 952.000000 | 952.000000 | 952.000000 | 952.000000 | 952.000000 | 952.000000 |
| mean | 21.017857 | 6.421218 | 576.169118 | 2.915504 | 2.026261 | 25.387605 | 8.726891 | 10.797269 | 11.134454 | 17.646008 | ... | 22.147059 | 22.161765 | 21.953782 | 21.967437 | 21.962185 | 21.278361 | 21.207983 | 21.887605 | 20.227941 | 21.697479 |
| std | 4.743549 | 11.149500 | 966.876113 | 3.363326 | 4.187803 | 33.643253 | 23.209986 | 24.582441 | 24.847649 | 29.408106 | ... | 31.379590 | 31.397574 | 31.517758 | 31.524697 | 31.523107 | 31.635164 | 31.861009 | 31.521945 | 32.042879 | 31.118391 |
| min | 14.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 18.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 20.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 23.000000 | 7.250000 | 781.250000 | 6.730000 | 2.000000 | 52.222500 | 0.000000 | 0.000000 | 0.000000 | 29.000000 | ... | 44.000000 | 44.000000 | 43.250000 | 44.000000 | 43.250000 | 44.000000 | 43.250000 | 44.000000 | 43.000000 | 35.000000 |
| max | 39.000000 | 43.000000 | 3870.000000 | 7.520000 | 22.000000 | 90.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | ... | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 |
8 rows × 196 columns
# Check for any missing values
print("\nMissing Values in Each Column:")
data.isnull().sum()
Missing Values in Each Column:
Name 0
Position 0
Age 0
Height 0
Weight 0
..
Finisher 0
Aerial Threat 0
Reader 0
Assister 0
Ball Winning Defenders 0
Length: 210, dtype: int64
# Distribution of values in key categorical columns, like Position and Club
print("\nUnique Values in 'Position' Column:")
data['Position'].value_counts()
Unique Values in 'Position' Column:
Position
GK 114
D (C) 87
ST (C) 62
DM, M (C) 51
M/AM (C) 46
...
D/WB (R), DM, M (RC), AM (R) 1
D (RC), WB (R), DM, M (RC) 1
M (LC), AM (C) 1
D (RL), WB/M (R) 1
D/WB/AM (R), ST (C) 1
Name: count, Length: 127, dtype: int64
print("\nUnique Values in 'Club' Column:")
data['Club'].value_counts()
Unique Values in 'Club' Column:
Club Feyenoord 89 PSV 68 sc Heerenveen 58 PEC Zwolle 58 FC Groningen 56 Go Ahead Eagles 55 N.E.C. Nijmegen 52 NAC Breda 49 Willem II 48 FC Volendam 48 Excelsior 46 Fortuna Sittard 42 FC Utrecht 42 Vitesse 38 Sparta Rotterdam 35 AZ 34 RKC Waalwijk 33 Almere City 32 Heracles Almelo 32 FC Twente 30 Ajax Amateurs 6 Ajax 1 Name: count, dtype: int64
# Print all column names
print("Column Names:")
for col in data.columns:
print(col)
Column Names: Name Position Age Height Weight Club Division Nationality Home-Grown Personality Media Handling Wage Transfer Value Asking Price Preferred Foot Starts Minutes Played Average Rating Sub Appearances Minutes/Game Goals (percentile) Goals/90 (percentile) Minutes/Goal (percentile) xG (percentile) xG/90 (percentile) xG/Shot (percentile) xG Overperformance (percentile) xG Overperformance/90 (percentile) Non-pen Goals (percentile) Non-pen Goals/90 (percentile) Non-pen Goals/Shot (percentile) Minutes/Non-pen Goal (percentile) Non-pen xG (percentile) Non-pen xG/90 (percentile) Non-pen Goals - Non-pens xG /90 (percentile) Non-pen xG/Shot (percentile) Non-pen xG Overperformance (percentile) Non-pen xG Overperformance/90 (percentile) Goals Outside Box (percentile) Goals Outside Box/90 (percentile) Assists (percentile) Assists/90 (percentile) Minutes/Assist (percentile) xA (percentile) xA/90 (percentile) xA Overperformance (percentile) xA Overperformance/90 (percentile) Assists/Clear Cut Chances Created (percentile) Goal Contributions (percentile) Goal Contributions/90 (percentile) xGC (percentile) xGC/90 (percentile) xGC Overperformance (percentile) xGC Overperformance/90 (percentile) Non-pen Goal Contributions (percentile) Non-pen Goal Contributions/90 (percentile) Non-pen xGC (percentile) Non-pen xGC/90 (percentile) Non-pen xGC Overperformance (percentile) Non-pen xGC Overperformance/90 (percentile) Conversion % (percentile) Shots (percentile) Shots/90 (percentile) Shots on Target (percentile) Shots on Target/90 (percentile) Shots on Target % (percentile) Shots Outside Box/90 (percentile) Passes Attempted (percentile) Passes Attempted/90 (percentile) Passes Completed (percentile) Passes Completed/90 (percentile) Pass Completion % (percentile) Progressive Passes/90 (percentile) Progressive Passes (percentile) Progressive Pass Rate (percentile) Key Passes (percentile) Key Passes/90 (percentile) Key Pass % (percentile) Open Play Key Passes (percentile) Open Play Key Passes/90 (percentile) Open Play Key Pass % (percentile) Crosses Attempted (percentile) Crosses Attempted/90 (percentile) Crosses Completed (percentile) Crosses Completed/90 (percentile) Crosses Completed % (percentile) Open Play Crosses Attempted (percentile) Open Play Crosses Attempted/90 (percentile) Open Play Crosses Completed (percentile) Open Play Crosses Completed/90 (percentile) Open Play Cross Completion % (percentile) Chances Created (percentile) Chances Created/90 (percentile) Clear Cut Chances Created (percentile) Clear Cut Chances Created/90 (percentile) Pressures Attempted (percentile) Pressures Attempted/90 (percentile) Pressures Completed (percentile) Pressures Completed/90 (percentile) Pressure Success % (percentile) Possession Won/90 (percentile) Possession Lost/90 (percentile) Poss+-/90 (percentile) Poss+- % (percentile) Dribbles/90 (percentile) Dribbles (percentile) Penalties Taken (percentile) Penalties Scored (percentile) Pens Scored % (percentile) Tackles Attempted (percentile) Tackles Attempted/90 (percentile) Tackles Completed (percentile) Tackles Completed/90 (percentile) Tackles Failed (percentile) Tackle Completion % (percentile) Tackles Failed/90 (percentile) Key Tackles (percentile) Key Tackles/90 (percentile) Tackle Quality (percentile) Interceptions (percentile) Interceptions/90 (percentile) Blocks (percentile) Blocks/90 (percentile) Shots Blocked (percentile) Shots Blocked/90 (percentile) Headers Attempted (percentile) Headers Attempted/90 (percentile) Headers Won (percentile) Headers Won/90 (percentile) Headers Won % (percentile) Headers Lost (percentile) Headers Lost/90 (percentile) Headers Lost % (percentile) Key Headers (percentile) Key Headers/90 (percentile) Aerial Challenges Attempted/90 (percentile) Duels Win % (percentile) Fouls Against (percentile) Fouls Made (percentile) Net Fouls (percentile) Fouls Won/90 (percentile) Fouls Committed/90 (percentile) Clearances (percentile) Clearances/90 (percentile) Offsides (percentile) Offsides/90 (percentile) Offside/Non-pen Goals (percentile) Offside/Non-pen xG (percentile) Distance Covered/90 (percentile) Distance Covered (percentile) Total Saves (percentile) Saves/90 (percentile) Save % (percentile) xSave % (percentile) xSave % Overperformance (percentile) Saves Held (percentile) Saves Held/90 (percentile) Saves Held Ratio (percentile) Saves Held/Shots Faced Ratio (percentile) Saves Tipped (percentile) Saves Tipped/90 (percentile) Saves Tipped Ratio (percentile) Saves Tipped/Shots Faced Ratio (percentile) Saves Parried (percentile) Saves Parried/90 (percentile) Saves Parried Ratio (percentile) Saves Parried/Shots Faced Ratio (percentile) Saves/Goal Conceaded (percentile) Save Efficiency (percentile) Shots on Target Against (percentile) Shots on Target Against/90 (percentile) xGP (percentile) xGP/90 (percentile) Penalties Faced (percentile) Penalties Saved (percentile) Pens Saved % (percentile) Goals Conceded (percentile) Conceded/90 (percentile) Clean Sheets (percentile) Clean Sheet Ratio (percentile) Red Cards (percentile) Yellow Cards (percentile) Yellows/Tackle (percentile) Reds/Tackle (percentile) Yellows/90 (percentile) Reds/90 (percentile) Player of the Match (percentile) Mistakes Leading to Goal (percentile) Sprints/90 (percentile) Attacking Actions/90 (percentile) Creative Actions/90 (percentile) Defensive Actions/90 (percentile) Goalkeeping Actions/90 (percentile) Excitement Factor/90 (percentile) General Performance Goalkeeping Defensive Defender Creative Defender Attacking Defender Creative Midfielder Attacking Midfielder Creative Winger Attacking Winger Creative Forward Attacking Forward Finisher Aerial Threat Reader Assister Ball Winning Defenders
# Check for duplicate column names
duplicate_columns = data.columns[data.columns.duplicated()].tolist()
if duplicate_columns:
print("Duplicate columns found:", duplicate_columns)
# Drop duplicate columns, keeping the first occurrence
data = data.loc[:, ~data.columns.duplicated()]
# Columns to keep in addition to the metric
additional_columns = ['Name', 'Position', 'Age', 'Height', 'Weight', 'Club',
'Division', 'Nationality', 'Personality', 'Media Handling',
'Wage', 'Transfer Value', 'Asking Price', 'Preferred Foot', 'Minutes Played']
# Function to get the top players for a specific metric with a minimum Minutes Played filter
def get_top_players(metric, min_minutes=500, top_n=5):
if metric not in data.columns:
print(f"Metric '{metric}' not found in data.")
return None
if 'Minutes Played' not in data.columns:
print("Column 'Minutes Played' not found in data.")
return None
# Filter players based on the Minutes Played threshold
filtered_data = data[data['Minutes Played'] > min_minutes]
# Select the additional columns and the specified metric, then get the top players
top_players = filtered_data[additional_columns + [metric]].sort_values(by=metric, ascending=False).head(top_n)
return top_players
# Example usage
metric = 'Non-pen Goals (percentile)' # Replace with the metric of your choice
min_minutes = 500 # Set the minimum minutes played
top_n = 5 # Number of top players to return
top_players_df = get_top_players(metric, min_minutes=min_minutes, top_n=top_n)
# Display the resulting DataFrame
top_players_df
| Name | Position | Age | Height | Weight | Club | Division | Nationality | Personality | Media Handling | Wage | Transfer Value | Asking Price | Preferred Foot | Minutes Played | Non-pen Goals (percentile) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | Santiago Giménez | ST (C) | 23 | 6'0" | 152 lbs | Feyenoord | Eredivisie | MEX (ARG) | Fairly Determined | Level-headed | £21,500 p/w | £54M - £60M | - | Left | 2811 | 100 |
| 3 | Bas Dost | ST (C) | 34 | 6'5" | 180 lbs | FC Groningen | Eredivisie | NED | Spirited | Media-friendly | £3,600 p/w | £160K - £1.6M | - | Right | 2882 | 100 |
| 0 | Luuk de Jong | ST (C) | 33 | 6'2" | 185 lbs | PSV | Eredivisie | NED (SUI) | Fairly Professional | Level-headed | £41,000 p/w | Not for Sale | - | Right | 3203 | 98 |
| 221 | Victor Edvardsen | AM (RL), ST (C) | 28 | 6'1" | 191 lbs | Go Ahead Eagles | Eredivisie | SWE | Balanced | Media-friendly | £2,800 p/w | £550K - £6.2M | - | Right Only | 2951 | 98 |
| 290 | Kevin van Kippersluis | M/AM (RLC), ST (C) | 30 | 6'1" | 163 lbs | Ajax Amateurs | Dutch Vierde Divisie A | NED (GER) | Balanced | Media-friendly | £0 p/w | £0 | - | Left Only | 3870 | 98 |
This code creates an interactive tool for analyzing football player statistics using pandas and ipywidgets in a Jupyter notebook environment. Here's a detailed breakdown of its functionality:
The tool uses several ipywidgets to create an interactive interface:
metric_select: A multiple-selection widget that allows users to choose which metrics to analyze. It only includes numeric columns that end with '(percentile)'.min_minutes_slider: A slider to set the minimum number of minutes played by players to be included in the analysis.top_n_slider: A slider to determine how many top players to display in the results.The create_metric_inputs function dynamically generates input fields for each selected metric:
The get_top_players_interactive function is the core of the analysis:
The code uses widgets.interactive to create a responsive interface that updates the results whenever the user changes any input (selected metrics, minimum minutes, number of top players, or metric ranges).
The final layout combines all widgets into a vertical box (VBox) for a clean, user-friendly interface:
This tool provides a powerful and flexible way to analyze player performance across various metrics, allowing for quick identification of top performers based on specific criteria.
import pandas as pd
from IPython.display import display
import ipywidgets as widgets
# Load your data
data = pd.read_csv('/Users/JumpMan/Downloads/FM Data Analytics/Feyanoord/Moneyball Scouting/2023-24/2023-24 Eredivese.csv')
# Check for duplicate column names
duplicate_columns = data.columns[data.columns.duplicated()].tolist()
if duplicate_columns:
print("Duplicate columns found:", duplicate_columns)
# Drop duplicate columns, keeping the first occurrence
data = data.loc[:, ~data.columns.duplicated()]
# Columns to keep in addition to the metrics
additional_columns = ['Name', 'Position', 'Age', 'Height', 'Weight', 'Club',
'Division', 'Nationality', 'Personality', 'Media Handling',
'Wage', 'Transfer Value', 'Asking Price', 'Preferred Foot', 'Minutes Played']
# Define the widget elements without the "(percentile)" filter for troubleshooting
metric_select = widgets.SelectMultiple(
options=[col for col in data.columns if data[col].dtype in ['float64', 'int64']],
description='Metrics',
disabled=False,
layout=widgets.Layout(width='50%', height='300px')
)
min_minutes_slider = widgets.IntSlider(value=500, min=0, max=3000, step=100, description='Min Minutes')
top_n_slider = widgets.IntSlider(value=5, min=1, max=20, description='Top N')
# Search input for player name
player_search = widgets.Text(description='Player Name', placeholder='Type player name here')
# Dropdown filters for categorical fields
team_dropdown = widgets.Dropdown(
options=[''] + sorted(data['Club'].dropna().unique().tolist()),
description='Team'
)
league_dropdown = widgets.Dropdown(
options=[''] + sorted(data['Division'].dropna().unique().tolist()),
description='League'
)
nationality_dropdown = widgets.Dropdown(
options=[''] + sorted(data['Nationality'].dropna().unique().tolist()),
description='Nationality'
)
# Dictionary to store metric input fields
metric_inputs = {}
# Function to create metric input fields
def create_metric_inputs(change):
# Clear existing widgets in metric_inputs
for metric, inputs in metric_inputs.items():
inputs[0].close()
inputs[1].close()
metric_inputs.clear()
# Update the VBox children with the new selected metrics
inputs_vbox.children = []
for metric in change['new']:
min_input = widgets.FloatText(value=0, description='Min', step=1)
max_input = widgets.FloatText(value=100, description='Max', step=1)
metric_inputs[metric] = (min_input, max_input)
inputs_vbox.children += (widgets.HBox([widgets.Label(metric), min_input, max_input]),)
# Create a VBox to hold the metric input fields
inputs_vbox = widgets.VBox([])
# Function to get the top players based on selected metrics, filters, and search criteria
def get_top_players_interactive(selected_metrics, min_minutes, top_n, player_name, team, league, nationality):
# Start with the full dataset
filtered_data = data
# Filter by player name if provided
if player_name:
filtered_data = filtered_data[filtered_data['Name'].str.contains(player_name, case=False, na=False)]
if filtered_data.empty:
print(f"No player found with the name '{player_name}'.")
return None
# Filter by team if selected
if team:
filtered_data = filtered_data[filtered_data['Club'] == team]
if filtered_data.empty:
print(f"No players found for the team '{team}'.")
return None
# Filter by league if selected
if league:
filtered_data = filtered_data[filtered_data['Division'] == league]
if filtered_data.empty:
print(f"No players found in the league '{league}'.")
return None
# Filter by nationality if selected
if nationality:
filtered_data = filtered_data[filtered_data['Nationality'] == nationality]
if filtered_data.empty:
print(f"No players found from the nationality '{nationality}'.")
return None
# Further filter by minimum minutes played
filtered_data = filtered_data[filtered_data['Minutes Played'] > min_minutes]
# Apply min/max filters for each metric
for metric, (min_input, max_input) in metric_inputs.items():
filtered_data = filtered_data[
(filtered_data[metric] >= min_input.value) &
(filtered_data[metric] <= max_input.value)
]
if filtered_data.empty:
print("No players match all criteria. Try adjusting your filters.")
return None
# Sort by the first selected metric
if selected_metrics:
sorted_data = filtered_data.sort_values(by=selected_metrics[0], ascending=False)
else:
sorted_data = filtered_data
# Select top N players
top_players = sorted_data.head(top_n)
# Display only the selected metrics and additional columns
columns_to_display = additional_columns + list(selected_metrics)
display(top_players[columns_to_display])
return top_players
# Connect the metric selection to input field creation
metric_select.observe(create_metric_inputs, names='value')
# Set up widgets for interactive functionality
interactive_output = widgets.interactive_output(
get_top_players_interactive,
{'selected_metrics': metric_select, 'min_minutes': min_minutes_slider, 'top_n': top_n_slider,
'player_name': player_search, 'team': team_dropdown, 'league': league_dropdown, 'nationality': nationality_dropdown}
)
# Combine all widgets into the final layout
final_widget = widgets.VBox([
widgets.HBox([metric_select, widgets.VBox([min_minutes_slider, top_n_slider])]),
widgets.HBox([player_search, team_dropdown, league_dropdown, nationality_dropdown]),
inputs_vbox,
interactive_output
])
display(final_widget)
VBox(children=(HBox(children=(SelectMultiple(description='Metrics', layout=Layout(height='300px', width='50%')…
### import pandas as pd
from IPython.display import display
import ipywidgets as widgets
# Load your data
data_europe = pd.read_csv('//Users/JumpMan/Downloads/FM Data Analytics/Feyanoord/Moneyball Scouting/2023-24/Top 7 leagues in europe.html.csv')
# Check for duplicate column names
duplicate_columns = data_europe.columns[data_europe.columns.duplicated()].tolist()
if duplicate_columns:
print("Duplicate columns found:", duplicate_columns)
# Drop duplicate columns, keeping the first occurrence
data_europe = data_europe.loc[:, ~data_europe.columns.duplicated()]
# Columns to keep in addition to the metrics
additional_columns = ['Name', 'Position', 'Age', 'Height', 'Weight', 'Club',
'Division', 'Nationality', 'Personality', 'Media Handling',
'Wage', 'Transfer Value', 'Asking Price', 'Preferred Foot', 'Minutes Played']
# Define the widget elements without the "(percentile)" filter for troubleshooting
metric_select = widgets.SelectMultiple(
options=[col for col in data_europe.columns if data_europe[col].dtype in ['float64', 'int64']],
description='Metrics',
disabled=False,
layout=widgets.Layout(width='50%', height='300px')
)
min_minutes_slider = widgets.IntSlider(value=500, min=0, max=3000, step=100, description='Min Minutes')
top_n_slider = widgets.IntSlider(value=5, min=1, max=20, description='Top N')
# Search input for player name
player_search = widgets.Text(description='Player Name', placeholder='Type player name here')
# Dropdown filters for categorical fields
team_dropdown = widgets.Dropdown(
options=[''] + sorted(data_europe['Club'].dropna().unique().tolist()),
description='Team'
)
league_dropdown = widgets.Dropdown(
options=[''] + sorted(data_europe['Division'].dropna().unique().tolist()),
description='League'
)
nationality_dropdown = widgets.Dropdown(
options=[''] + sorted(data_europe['Nationality'].dropna().unique().tolist()),
description='Nationality'
)
# Dictionary to store metric input fields
metric_inputs = {}
# Function to create metric input fields
def create_metric_inputs(change):
# Clear existing widgets in metric_inputs
for metric, inputs in metric_inputs.items():
inputs[0].close()
inputs[1].close()
metric_inputs.clear()
# Update the VBox children with the new selected metrics
inputs_vbox.children = []
for metric in change['new']:
min_input = widgets.FloatText(value=0, description='Min', step=1)
max_input = widgets.FloatText(value=100, description='Max', step=1)
metric_inputs[metric] = (min_input, max_input)
inputs_vbox.children += (widgets.HBox([widgets.Label(metric), min_input, max_input]),)
# Create a VBox to hold the metric input fields
inputs_vbox = widgets.VBox([])
# Function to get the top players based on selected metrics, filters, and search criteria
def get_top_players_interactive(selected_metrics, min_minutes, top_n, player_name, team, league, nationality):
# Start with the full dataset
filtered_data = data_europe
# Filter by player name if provided
if player_name:
filtered_data = filtered_data[filtered_data['Name'].str.contains(player_name, case=False, na=False)]
if filtered_data.empty:
print(f"No player found with the name '{player_name}'.")
return None
# Filter by team if selected
if team:
filtered_data = filtered_data[filtered_data['Club'] == team]
if filtered_data.empty:
print(f"No players found for the team '{team}'.")
return None
# Filter by league if selected
if league:
filtered_data = filtered_data[filtered_data['Division'] == league]
if filtered_data.empty:
print(f"No players found in the league '{league}'.")
return None
# Filter by nationality if selected
if nationality:
filtered_data = filtered_data[filtered_data['Nationality'] == nationality]
if filtered_data.empty:
print(f"No players found from the nationality '{nationality}'.")
return None
# Further filter by minimum minutes played
filtered_data = filtered_data[filtered_data['Minutes Played'] > min_minutes]
# Apply min/max filters for each metric
for metric, (min_input, max_input) in metric_inputs.items():
filtered_data = filtered_data[
(filtered_data[metric] >= min_input.value) &
(filtered_data[metric] <= max_input.value)
]
if filtered_data.empty:
print("No players match all criteria. Try adjusting your filters.")
return None
# Sort by the first selected metric
if selected_metrics:
sorted_data = filtered_data.sort_values(by=selected_metrics[0], ascending=False)
else:
sorted_data = filtered_data
# Select top N players
top_players = sorted_data.head(top_n)
# Display only the selected metrics and additional columns
columns_to_display = additional_columns + list(selected_metrics)
display(top_players[columns_to_display])
return top_players
# Connect the metric selection to input field creation
metric_select.observe(create_metric_inputs, names='value')
# Set up widgets for interactive functionality
interactive_output = widgets.interactive_output(
get_top_players_interactive,
{'selected_metrics': metric_select, 'min_minutes': min_minutes_slider, 'top_n': top_n_slider,
'player_name': player_search, 'team': team_dropdown, 'league': league_dropdown, 'nationality': nationality_dropdown}
)
# Combine all widgets into the final layout
final_widget = widgets.VBox([
widgets.HBox([metric_select, widgets.VBox([min_minutes_slider, top_n_slider])]),
widgets.HBox([player_search, team_dropdown, league_dropdown, nationality_dropdown]),
inputs_vbox,
interactive_output
])
display(final_widget)
VBox(children=(HBox(children=(SelectMultiple(description='Metrics', layout=Layout(height='300px', width='50%')…
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import pandas as pd
def dynamic_clustering(data, features, n_clusters):
# Select the specified features from the data
data_selected = data[features]
# Normalize the data using StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_selected)
# Apply KMeans clustering
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
clusters = kmeans.fit_predict(data_scaled)
# Add cluster assignments to the data
data['Cluster'] = clusters
# Apply PCA for dimensionality reduction
pca = PCA(n_components=2)
pca_components = pca.fit_transform(data_scaled)
# Add PCA components to the data
data['PCA1'] = pca_components[:, 0]
data['PCA2'] = pca_components[:, 1]
# Calculate cluster centroids in PCA space
# To calculate centroids in the PCA space, we need to use the 2 PCA components
pca_centroids = pca.transform(kmeans.cluster_centers_)
# Convert centroids to a DataFrame and use PCA1 and PCA2 for the columns
centroids = pd.DataFrame(pca_centroids, columns=['PCA1', 'PCA2'])
return data, centroids
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import plotly.express as px
import pandas as pd
# Define attacking metrics
attacking_metrics = [
'Goals (percentile)', 'Goals/90 (percentile)', 'Minutes/Goal (percentile)', 'xG (percentile)',
'xG/90 (percentile)', 'xG/Shot (percentile)', 'Non-pen Goals (percentile)', 'Non-pen Goals/90 (percentile)',
'Non-pen Goals/Shot (percentile)', 'Non-pen xG (percentile)', 'Non-pen xG/90 (percentile)', 'Shots (percentile)',
'Shots/90 (percentile)', 'Shots on Target (percentile)', 'Shots on Target/90 (percentile)', 'Shots on Target % (percentile)',
'Shots Outside Box/90 (percentile)', 'Goal Contributions (percentile)', 'Goal Contributions/90 (percentile)',
'Goals Outside Box (percentile)', 'Goals Outside Box/90 (percentile)', 'Conversion % (percentile)',
'Penalties Taken (percentile)', 'Penalties Scored (percentile)', 'Pens Scored % (percentile)'
]
# Call the dynamic clustering function for attacking metrics
attacking_clustered, attacking_centroids = dynamic_clustering(
data=data_europe, # Updated variable name
features=attacking_metrics,
n_clusters=4
)
# Calculate a combined score from the attacking metrics (average in this case)
attacking_clustered['Combined Score'] = attacking_clustered[attacking_metrics].mean(axis=1)
# Sort players by combined score to get the top performers
top_5_attacking = attacking_clustered.nlargest(5, 'Combined Score')
# Display the results
print("Top 5 Attacking Players based on combined score:")
display(top_5_attacking)
# Interactive Plotly Visualization for Attacking Metrics
fig1 = px.scatter(
attacking_clustered,
x='PCA1',
y='PCA2',
color='Cluster',
symbol='Division', # Use the correct column name for league
size='Minutes Played',
hover_data=['Name', 'Age', 'Cluster', 'Position', 'Minutes Played', 'Division'],
title='Interactive Attacking Player Clustering'
)
fig1.update_layout(
title_x=0.5, # Center the title
width=900,
height=600,
legend=dict(
x=1.05, # Adjust horizontal position (right of the plot)
y=1, # Adjust vertical position (top of the plot)
xanchor="left", # Anchor legend box to the left
yanchor="top", # Anchor legend box to the top
title=dict(text="Cluster")
),
paper_bgcolor='rgb(243, 243, 243)', # Light background for clean look
plot_bgcolor='rgba(0,0,0,0)' # Transparent plot background
)
# Show the Plot
#fig1.show() # Uncomment to show the plot
# Display Cluster Centroids for Attacking Metrics
print("Attacking Cluster Centroids:")
display(attacking_centroids)
/Users/JumpMan/anaconda3/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Top 5 Attacking Players based on combined score:
| Name | Position | Age | Height | Weight | Inf | Club | Division | Nationality | Home-Grown | ... | Attacking Forward | Finisher | Aerial Threat | Reader | Assister | Ball Winning Defenders | Cluster | PCA1 | PCA2 | Combined Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | Erling Haaland | ST (C) | 23 | 6'5" | 205 lbs | - | Man City | English Premier Division | NOR (ENG) | - | ... | 97 | 99 | 76 | 37 | 91 | 52 | 3 | 14.533494 | 12.559919 | 91.64 |
| 6 | Harry Kane | AM (C), ST (C) | 30 | 6'2" | 189 lbs | - | FC Bayern | Bundesliga | ENG (IRL) | - | ... | 99 | 99 | 77 | 36 | 86 | 53 | 3 | 14.272450 | 10.437637 | 89.92 |
| 0 | Kylian Mbappé | AM (RL), ST (C) | 25 | 5'10" | 160 lbs | - | R. Madrid | Spanish First Division | FRA | - | ... | 100 | 99 | 46 | 29 | 97 | 37 | 3 | 14.140771 | 10.753558 | 89.00 |
| 153 | Victor Boniface | ST (C) | 23 | 6'3" | 200 lbs | - | Bayer 04 | Bundesliga | NGA | - | ... | 83 | 97 | 74 | 29 | 50 | 28 | 3 | 13.979868 | 11.647960 | 88.92 |
| 55 | Romelu Lukaku | ST (C) | 31 | 6'3" | 205 lbs | - | Parthenope | Italian Serie A | BEL (COD) | - | ... | 95 | 98 | 70 | 32 | 96 | 13 | 3 | 13.753185 | 10.052220 | 88.08 |
5 rows × 215 columns
Attacking Cluster Centroids:
| PCA1 | PCA2 | |
|---|---|---|
| 0 | -2.588840 | 0.308740 |
| 1 | 7.441651 | -0.862183 |
| 2 | 1.224423 | -0.907023 |
| 3 | 11.167452 | 8.361225 |
Attacking Score Search
def search_player_stats(player_name, attacking_clustered, attacking_metrics):
"""
Search for a player by name and return their attacking stats and combined attacking score.
Args:
player_name (str): Name of the player to search for.
attacking_clustered (DataFrame): The DataFrame with player clustering data.
attacking_metrics (list): The list of attacking metrics.
Returns:
dict or str: Player's stats and combined score, or message if player is not found.
"""
# Filter the DataFrame for the player by name (case-insensitive)
player_data = attacking_clustered[attacking_clustered['Name'].str.contains(player_name, case=False, na=False)]
# If player is found, return their stats and combined score
if not player_data.empty:
player_stats = player_data[attacking_metrics + ['Combined Score']].iloc[0].to_dict()
return player_stats
return f"Player '{player_name}' not found."
# User input to search for a player
player_name = input("Enter the player's name: ")
# Get player stats and combined score
player_stats = search_player_stats(player_name, attacking_clustered, attacking_metrics)
# Display the result
if isinstance(player_stats, dict):
print(f"\nStats for {player_name}:")
# Using list comprehension for concise output
print("\n".join([f"{stat}: {value}" for stat, value in player_stats.items()]))
else:
print(player_stats)
Enter the player's name: Foden Stats for Foden: Goals (percentile): 95.0 Goals/90 (percentile): 73.0 Minutes/Goal (percentile): 73.0 xG (percentile): 98.0 xG/90 (percentile): 79.0 xG/Shot (percentile): 60.0 Non-pen Goals (percentile): 96.0 Non-pen Goals/90 (percentile): 71.0 Non-pen Goals/Shot (percentile): 51.0 Non-pen xG (percentile): 97.0 Non-pen xG/90 (percentile): 78.0 Shots (percentile): 98.0 Shots/90 (percentile): 81.0 Shots on Target (percentile): 98.0 Shots on Target/90 (percentile): 84.0 Shots on Target % (percentile): 82.0 Shots Outside Box/90 (percentile): 72.0 Goal Contributions (percentile): 99.0 Goal Contributions/90 (percentile): 89.0 Goals Outside Box (percentile): 0.0 Goals Outside Box/90 (percentile): 0.0 Conversion % (percentile): 55.0 Penalties Taken (percentile): 60.0 Penalties Scored (percentile): 0.0 Pens Scored % (percentile): 1.0 Combined Score: 67.6
# Define creative metrics
creative_metrics = [
'Assists (percentile)', 'Assists/90 (percentile)', 'xA (percentile)', 'xA/90 (percentile)', 'xA Overperformance (percentile)',
'Key Passes (percentile)', 'Key Passes/90 (percentile)', 'Key Pass % (percentile)', 'Open Play Key Passes (percentile)',
'Chances Created (percentile)', 'Chances Created/90 (percentile)', 'Clear Cut Chances Created (percentile)',
'Clear Cut Chances Created/90 (percentile)', 'Progressive Passes/90 (percentile)', 'Progressive Passes (percentile)',
'Pass Completion % (percentile)', 'Crosses Attempted (percentile)', 'Crosses Attempted/90 (percentile)',
'Crosses Completed (percentile)', 'Crosses Completed/90 (percentile)', 'Passes Attempted (percentile)',
'Passes Attempted/90 (percentile)'
]
# Call the dynamic clustering function for creative metrics
creative_clustered, creative_centroids = dynamic_clustering(
data=data_europe, # Updated variable name
features=creative_metrics,
n_clusters=4
)
# Calculate a combined score from the creative metrics (average in this case)
creative_clustered['Combined Score'] = creative_clustered[creative_metrics].mean(axis=1)
# Sort players by combined score to get the top performers
top_5_creative = creative_clustered.nlargest(5, 'Combined Score')
# Display the results
print("Top 5 Creative Players based on combined score:")
display(top_5_creative)
# Interactive Plotly Visualization for Creative Metrics
fig2 = px.scatter(
creative_clustered,
x='PCA1',
y='PCA2',
color='Cluster',
symbol='Division', # Use the correct column name for league
size='Minutes Played',
hover_data=['Name', 'Age', 'Cluster', 'Position', 'Minutes Played', 'Division'],
title='Interactive Creative Player Clustering'
)
fig2.update_layout(
title_x=0.5,
width=900,
height=600,
legend=dict(
x=1.05,
y=1,
xanchor="left",
yanchor="top",
title=dict(text="Cluster")
),
paper_bgcolor='rgb(243, 243, 243)', # Light background for clean look
plot_bgcolor='rgba(0,0,0,0)'
)
# Show the Plot
#fig2.show()
# Display Cluster Centroids for Creative Metrics
print("Creative Cluster Centroids:")
display(creative_centroids)
/Users/JumpMan/anaconda3/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Top 5 Creative Players based on combined score:
| Name | Position | Age | Height | Weight | Inf | Club | Division | Nationality | Home-Grown | ... | Attacking Forward | Finisher | Aerial Threat | Reader | Assister | Ball Winning Defenders | Cluster | PCA1 | PCA2 | Combined Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 94 | Hakan Çalhanoğlu | DM, M/AM (C) | 30 | 5'10" | 152 lbs | - | Inter | Italian Serie A | TUR (GER) | - | ... | 79 | 47 | 27 | 54 | 99 | 65 | 1 | 10.514077 | -1.236828 | 93.318182 |
| 66 | Luka Modrić | DM, M/AM (C) | 38 | 5'8" | 145 lbs | - | R. Madrid | Spanish First Division | CRO | - | ... | 94 | 58 | 19 | 37 | 99 | 53 | 1 | 10.563184 | -2.395553 | 93.000000 |
| 34 | Martin Ødegaard | M/AM (C) | 25 | 5'10" | 149 lbs | - | Arsenal | English Premier Division | NOR | - | ... | 99 | 90 | 19 | 47 | 98 | 54 | 1 | 10.566436 | -1.976729 | 90.545455 |
| 585 | Pascal Groß | D (R), DM, M/AM (C) | 32 | 5'11" | 167 lbs | - | Borussia Dortmund | Bundesliga | GER | - | ... | 90 | 56 | 60 | 73 | 100 | 82 | 1 | 10.550302 | -0.704693 | 90.500000 |
| 23 | Joshua Kimmich | D/WB (R), DM, M (C) | 29 | 5'10" | 165 lbs | - | Paris SG | Ligue 1 Uber Eats | GER | - | ... | 76 | 33 | 57 | 58 | 99 | 71 | 1 | 9.941278 | -1.914790 | 89.363636 |
5 rows × 215 columns
Creative Cluster Centroids:
| PCA1 | PCA2 | |
|---|---|---|
| 0 | 2.188440 | -0.251752 |
| 1 | 6.578570 | -0.717672 |
| 2 | 0.887482 | 2.745893 |
| 3 | -2.960795 | -0.377333 |
Creative Score Search
def search_player_stats(player_name, creative_clustered, creative_metrics):
"""
Search for a player by name and return their creative stats and combined creative score.
Args:
player_name (str): Name of the player to search for.
creative_clustered (DataFrame): The DataFrame with player clustering data.
creative_metrics (list): The list of creative metrics.
Returns:
dict or str: Player's stats and combined score, or message if player is not found.
"""
# Filter the DataFrame for the player by name (case-insensitive)
player_data = creative_clustered[creative_clustered['Name'].str.contains(player_name, case=False, na=False)]
# If player is found, return their stats and combined score
if not player_data.empty:
player_stats = player_data[creative_metrics + ['Combined Score']].iloc[0].to_dict()
return player_stats
return f"Player '{player_name}' not found."
# User input to search for a player
player_name = input("Enter the player's name: ")
# Get player stats and combined score
player_stats = search_player_stats(player_name, creative_clustered, creative_metrics)
# Display the result
if isinstance(player_stats, dict):
print(f"\nStats for {player_name}:")
# Using list comprehension for concise output
print("\n".join([f"{stat}: {value}" for stat, value in player_stats.items()]))
else:
print(player_stats)
Enter the player's name: Foden Stats for Foden: Assists (percentile): 99.0 Assists/90 (percentile): 89.0 xA (percentile): 99.0 xA/90 (percentile): 89.0 xA Overperformance (percentile): 99.0 Key Passes (percentile): 98.0 Key Passes/90 (percentile): 83.0 Key Pass % (percentile): 80.0 Open Play Key Passes (percentile): 99.0 Chances Created (percentile): 99.0 Chances Created/90 (percentile): 93.0 Clear Cut Chances Created (percentile): 100.0 Clear Cut Chances Created/90 (percentile): 94.0 Progressive Passes/90 (percentile): 44.0 Progressive Passes (percentile): 87.0 Pass Completion % (percentile): 34.0 Crosses Attempted (percentile): 93.0 Crosses Attempted/90 (percentile): 68.0 Crosses Completed (percentile): 81.0 Crosses Completed/90 (percentile): 51.0 Passes Attempted (percentile): 89.0 Passes Attempted/90 (percentile): 44.0 Combined Score: 82.36363636363636
# Define defensive metrics
defensive_metrics = [
'Tackles Attempted (percentile)', 'Tackles Attempted/90 (percentile)', 'Tackles Completed (percentile)',
'Tackles Completed/90 (percentile)', 'Interceptions (percentile)', 'Interceptions/90 (percentile)', 'Clearances (percentile)',
'Clearances/90 (percentile)', 'Blocks (percentile)', 'Blocks/90 (percentile)', 'Key Tackles (percentile)',
'Key Tackles/90 (percentile)', 'Tackle Completion % (percentile)', 'Defensive Actions/90 (percentile)',
'Fouls Committed/90 (percentile)', 'Possession Won/90 (percentile)', 'Duels Win % (percentile)', 'Headers Won % (percentile)'
]
# Call the dynamic clustering function for defensive metrics
defensive_clustered, defensive_centroids = dynamic_clustering(
data=data_europe, # Updated variable name
features=defensive_metrics,
n_clusters=4
)
# Calculate a combined score from the defensive metrics (average in this case)
defensive_clustered['Combined Score'] = defensive_clustered[defensive_metrics].mean(axis=1)
# Sort players by combined score to get the top performers
top_5_defensive = defensive_clustered.nlargest(5, 'Combined Score')
# Display the results
print("Top 5 Defensive Players based on combined score:")
display(top_5_defensive)
# Interactive Plotly Visualization for Defensive Metrics
fig3 = px.scatter(
defensive_clustered,
x='PCA1',
y='PCA2',
color='Cluster',
symbol='Division', # Use the correct column name for league
size='Minutes Played',
hover_data=['Name', 'Age', 'Cluster', 'Position', 'Minutes Played', 'Division'],
title='Interactive Defensive Player Clustering'
)
fig3.update_layout(
title_x=0.5,
width=900,
height=600,
legend=dict(
x=1.05,
y=1,
xanchor="left",
yanchor="top",
title=dict(text="Cluster")
),
paper_bgcolor='rgb(243, 243, 243)', # Light background for clean look
plot_bgcolor='rgba(0,0,0,0)'
)
# Show the Plot
#fig3.show()
# Display Cluster Centroids for Defensive Metrics
print("Defensive Cluster Centroids:")
display(defensive_centroids)
/Users/JumpMan/anaconda3/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Top 5 Defensive Players based on combined score:
| Name | Position | Age | Height | Weight | Inf | Club | Division | Nationality | Home-Grown | ... | Attacking Forward | Finisher | Aerial Threat | Reader | Assister | Ball Winning Defenders | Cluster | PCA1 | PCA2 | Combined Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1892 | Django Warmerdam | D/WB (L), DM, M (LC) | 28 | 5'11" | 160 lbs | - | Excelsior | Eredivisie | NED | Trained in nation (15-21) | ... | 48 | 54 | 95 | 99 | 44 | 88 | 0 | 8.397846 | 0.601474 | 86.055556 |
| 2903 | Bas Kuipers | D/WB (L) | 29 | 5'11" | 163 lbs | - | FC Twente | Eredivisie | NED | Trained in nation (15-21) | ... | 73 | 53 | 85 | 99 | 97 | 87 | 0 | 8.121900 | 1.558674 | 85.166667 |
| 1774 | Jesús Vázquez | D/WB/M/AM (L) | 21 | 6'0" | 174 lbs | - | Valencia | Spanish First Division | ESP | - | ... | 46 | 5 | 68 | 93 | 70 | 79 | 0 | 8.396620 | 0.663128 | 85.000000 |
| 1231 | Maximilian Mittelstädt | D/WB (L) | 27 | 5'11" | 156 lbs | - | VfB Stuttgart | Bundesliga | GER | - | ... | 36 | 5 | 61 | 96 | 80 | 86 | 0 | 8.666008 | 1.584613 | 84.444444 |
| 2928 | Boy Kemper | D (LC), WB (L), DM | 24 | 6'1" | 180 lbs | Sct | NAC Breda | Eredivisie | NED | Trained in nation (15-21) | ... | 25 | 34 | 87 | 99 | 48 | 89 | 0 | 8.050326 | 1.287554 | 83.444444 |
5 rows × 215 columns
Defensive Cluster Centroids:
| PCA1 | PCA2 | |
|---|---|---|
| 0 | 5.895340 | 1.671089 |
| 1 | -2.955537 | 0.469125 |
| 2 | 3.271433 | -1.161332 |
| 3 | 0.041772 | -1.269763 |
Defensive Score Metric
def search_player_stats(player_name, defensive_clustered, defensive_metrics):
"""
Search for a player by name and return their defensive stats and combined defensive score.
Args:
player_name (str): Name of the player to search for.
defensive_clustered (DataFrame): The DataFrame with player clustering data.
defensive_metrics (list): The list of defensive metrics.
Returns:
dict or str: Player's stats and combined score, or message if player is not found.
"""
# Filter the DataFrame for the player by name (case-insensitive)
player_data = defensive_clustered[defensive_clustered['Name'].str.contains(player_name, case=False, na=False)]
# If player is found, return their stats and combined score
if not player_data.empty:
player_stats = player_data[defensive_metrics + ['Combined Score']].iloc[0].to_dict()
return player_stats
return f"Player '{player_name}' not found."
# User input to search for a player
player_name = input("Enter the player's name: ")
# Get player stats and combined score
player_stats = search_player_stats(player_name, defensive_clustered, defensive_metrics)
# Display the result
if isinstance(player_stats, dict):
print(f"\nStats for {player_name}:")
# Using list comprehension for concise output
print("\n".join([f"{stat}: {value}" for stat, value in player_stats.items()]))
else:
print(player_stats)
Enter the player's name: Phil Foden Stats for Phil Foden: Tackles Attempted (percentile): 98.0 Tackles Attempted/90 (percentile): 77.0 Tackles Completed (percentile): 97.0 Tackles Completed/90 (percentile): 69.0 Interceptions (percentile): 89.0 Interceptions/90 (percentile): 42.0 Clearances (percentile): 75.0 Clearances/90 (percentile): 26.0 Blocks (percentile): 82.0 Blocks/90 (percentile): 40.0 Key Tackles (percentile): 0.0 Key Tackles/90 (percentile): 0.0 Tackle Completion % (percentile): 22.0 Defensive Actions/90 (percentile): 39.0 Fouls Committed/90 (percentile): 51.0 Possession Won/90 (percentile): 42.0 Duels Win % (percentile): 15.0 Headers Won % (percentile): 13.0 Combined Score: 48.72222222222222
Cluster Plane search
#pip install fuzzywuzzy python-Levenshtein
import pandas as pd
import numpy as np
from fuzzywuzzy import process
from sklearn.metrics import pairwise_distances
from IPython.display import display
def find_similar_players(player_name, cluster_data, n=5):
"""
Finds the top `n` most similar players to the given player based on their position
in the PCA1-PCA2 cluster plane.
Parameters:
- player_name (str): The name of the player to search for.
- cluster_data (DataFrame): The DataFrame containing the cluster assignments, PCA1, PCA2, and player names.
- n (int): The number of similar players to return (default is 5).
Returns:
- DataFrame: A DataFrame with the top `n` most similar players based on their position in the PCA plane.
"""
# Ensure the player exists in the data
correct_name = fuzzy_search(player_name, cluster_data)
if not correct_name:
raise ValueError("No close match found for the player name.")
# Get the player's PCA coordinates
player_row = cluster_data[cluster_data['Name'] == correct_name]
player_pca1 = player_row['PCA1'].values[0]
player_pca2 = player_row['PCA2'].values[0]
# Calculate the Euclidean distance between the player's PCA coordinates and all other players
other_players = cluster_data[cluster_data['Name'] != correct_name]
distances = np.sqrt((other_players['PCA1'] - player_pca1)**2 + (other_players['PCA2'] - player_pca2)**2)
# Add distances to the dataframe
other_players['Distance'] = distances
# Sort by distance and return the top n most similar players
similar_players = other_players.nsmallest(n, 'Distance')[['Name', 'Distance', 'PCA1', 'PCA2']]
return similar_players, correct_name
def fuzzy_search(query, cluster_data):
"""
Performs a fuzzy search on the player names and returns the best match.
Parameters:
- query (str): The name of the player to search for.
- cluster_data (DataFrame): The DataFrame containing player names.
Returns:
- str: The closest matching player name, or None if no close match is found.
"""
# Get the list of player names
player_names = cluster_data['Name'].tolist()
# Use fuzzywuzzy to find the closest match
match, score = process.extractOne(query, player_names)
# Only return the match if the score is high enough (e.g., score > 80)
if score >= 80:
return match
else:
return None
# Example usage:
# Assuming `attacking_clustered` is your DataFrame containing player names and PCA1, PCA2 coordinates.
# Run a dynamic search while the code is running
player_name = input("Enter the player's name: ") # User input
try:
similar_players, correct_name = find_similar_players(player_name, attacking_clustered)
display(f"Did you mean: {correct_name}?") # Display the player name suggestion
display(similar_players) # Display the DataFrame with the top similar players
except ValueError as e:
display(e) # Display the error if no match is found
Enter the player's name: Phil Foden
/var/folders/bf/4lyx7mbx6_x3y4pb5d93fm7h0000gp/T/ipykernel_62540/2248603950.py:35: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
'Did you mean: Phil Foden?'
| Name | Distance | PCA1 | PCA2 | |
|---|---|---|---|---|
| 1061 | Robert Andrich | 0.030948 | 3.464052 | -1.429027 |
| 2466 | Yannik Engelhardt | 0.045184 | 3.525816 | -1.410203 |
| 2569 | Pelle Clement | 0.071949 | 3.439708 | -1.489610 |
| 796 | Harry Winks | 0.075548 | 3.550945 | -1.488922 |
| 1941 | Isaac Hayden | 0.094428 | 3.586660 | -1.449655 |
Statisitcal Player search
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.impute import SimpleImputer
def find_similar_players_statistics(player_name, data, n=6):
"""
Finds the top `n` most similar players based on their statistics.
Parameters:
- player_name (str): The name of the player to search for.
- data (DataFrame): The dataset containing player statistics.
- n (int): The number of similar players to return (default is 5).
Returns:
- DataFrame: A DataFrame with the top `n` most similar players based on their statistics.
"""
# Exclude columns that are not relevant for the similarity (e.g., 'Games', 'Minutes Played')
excluded_columns = ['Games', 'Minutes Played']
data_relevant = data.drop(columns=excluded_columns, errors='ignore')
# Retrieve the player’s stats by name
player_stats = data_relevant[data_relevant['Name'] == player_name].iloc[0]
# Ensure all columns in data are present in player_stats (fill missing columns)
missing_columns = set(data_relevant.columns) - set(player_stats.index)
for col in missing_columns:
player_stats[col] = np.nan # Fill missing columns with NaN
# Convert player statistics to DataFrame for compatibility
player_stats_df = pd.DataFrame([player_stats])
# Impute missing values (e.g., by filling with the mean of the column)
imputer = SimpleImputer(strategy='mean') # or 'median'
data_imputed = pd.DataFrame(imputer.fit_transform(data_relevant.select_dtypes(include=[np.number])))
player_stats_imputed = pd.DataFrame(imputer.transform(player_stats_df[data_relevant.select_dtypes(include=[np.number]).columns]))
# Standardize the data (only numeric columns)
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_imputed)
# Scale the input player's statistics
player_stats_scaled = scaler.transform(player_stats_imputed)
# Calculate Euclidean distances between the input player and all other players
distances = euclidean_distances(player_stats_scaled, data_scaled)
# Add distances to the dataset
data_relevant['Distance'] = distances[0]
# Sort by distance and return the top n most similar players
similar_players = data_relevant.nsmallest(n, 'Distance')[['Name', 'Distance'] + list(data_relevant.select_dtypes(include=[np.number]).columns)]
return similar_players
# Example usage: Input player name for comparison
player_name = input("Enter the name of the player you want to search for: ")
# Assuming `data_europe` is your DataFrame containing player statistics
similar_players = find_similar_players_statistics(player_name, data_europe)
print(f"Top 5 Most Similar Players to {player_name}:")
display(similar_players)
Enter the name of the player you want to search for: Erling Haaland Top 5 Most Similar Players to Erling Haaland:
| Name | Distance | Age | Starts | Average Rating | Sub Appearances | Minutes/Game | Goals (percentile) | Goals/90 (percentile) | Minutes/Goal (percentile) | ... | Finisher | Aerial Threat | Reader | Assister | Ball Winning Defenders | Cluster | PCA1 | PCA2 | Combined Score | Distance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | Erling Haaland | 0.000000 | 23 | 43 | 7.44 | 0 | 69.21 | 100 | 97 | 96 | ... | 99 | 76 | 37 | 91 | 52 | 3 | 1.429782 | -1.227910 | 30.611111 | 0.000000 |
| 6 | Harry Kane | 7.838198 | 30 | 38 | 7.42 | 0 | 80.71 | 100 | 97 | 96 | ... | 99 | 77 | 36 | 86 | 53 | 2 | 1.333988 | -0.957876 | 31.611111 | 7.838198 |
| 0 | Kylian Mbappé | 8.319176 | 25 | 49 | 7.60 | 0 | 82.69 | 100 | 98 | 97 | ... | 99 | 46 | 29 | 97 | 37 | 2 | 0.994908 | -0.788362 | 28.777778 | 8.319176 |
| 7 | Robert Lewandowski | 8.649842 | 35 | 49 | 7.24 | 0 | 88.00 | 100 | 91 | 91 | ... | 98 | 71 | 28 | 88 | 25 | 2 | 1.102815 | 0.134810 | 30.222222 | 8.649842 |
| 153 | Victor Boniface | 8.682437 | 23 | 35 | 7.15 | 2 | 69.76 | 99 | 94 | 93 | ... | 97 | 74 | 29 | 50 | 28 | 3 | 0.754453 | -1.242610 | 28.222222 | 8.682437 |
| 55 | Romelu Lukaku | 9.040984 | 31 | 34 | 7.16 | 0 | 82.50 | 99 | 94 | 93 | ... | 98 | 70 | 32 | 96 | 13 | 2 | 1.598513 | -0.421074 | 31.555556 | 9.040984 |
6 rows × 202 columns
Method:
Basis of Similarity:
Results:
Method:
Basis of Similarity:
Results:
Cluster Search:
Statistical Search:
Dimensionality vs. Specificity:
Clustered Data Grouping:
These methods answer different kinds of questions:
As a result, the results from these two searches will differ because one is based on style or role similarity (cluster search), while the other focuses on actual performance (statistical search).